Linear Classification

Part 1

Task 1

Necessary imports

Classifier implementation

Task 2

Constants' and utilities' definition

Datasets' generation, split and normalization

Presentation of generated training datasets

Task 3

Additional imports

Perform classification

Display metrics

Display the features' space split

Conclusions

Part 2

Task 1

Additional imports

Download data set and load it into dataframe

Task 2

Notes about traits/attributes, according to provided documentation, the number is corresponding to column index in dataframe

(assuming starting from 1), the type is determined according to documentation

  1. age: age in years; type=(integer, continuous)
  2. sex: sex; type=(integer, binary); values=(1 = male; 0 = female)
  3. cp: chest pain type; type=(integer, categorical); values={1: typical angina, 2: atypical angina, 3: non-anginal pain, 4: asymptomatic}
  4. trestbps: resting blood pressure (in mm Hg on admission to the hospital); type=(float, continuous)
  5. chol: serum cholestoral in mg/dl; type=(float, continuous)
  6. fbs: (fasting blood sugar > 120 mg/dl); type=(integer, binary); values={1 = true; 0 = false}
  7. restecg: resting electrocardiographic results; type=(integer, categorical); values={0: normal, 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), 2: showing probable or definite left ventricular hypertrophy by Estes' criteria}
  8. thalach: maximum heart rate achieved; type=(float, continuous)
  9. exang: exercise induced angina; type=(integer, binary); values=(1 = yes; 0 = no)
  10. oldpeak = ST depression induced by exercise relative to rest; type=(float, continuous)
  11. slope: the slope of the peak exercise ST segment; type=(integer, categorical); values={1: upsloping, 2: flat, 3: downsloping}
  12. ca: number of major vessels colored by flourosopy; type=(integer, categorical); values={0,1,2,3}
  13. thal: Thallium stress test indicator; type=(integer, categorical); values={3 = normal; 6 = fixed defect; 7 = reversable defect}
  14. class_diagnosis (originally: 'num'): diagnosis of heart disease (angiographic disease status); type=(integer, binary); values={0: Negative, 1,2,3,4: Positive}

Change '?' into NaNs

Count missing values

Remove rows with missing values of variables

Change floats to integers, where necessary (categories, binary values)

Split dataset into train and test datasets

The rest of operations will be performed on training dataset

Display the dependency of 'class_diagnosis' on categorical/binary attributes on nested pie charts

Display histograms of continuous attributes

Display correlation matrix of dataset

Compute statistical metrics for attributes

Which attributes do seem to be good in terms of classification problem?

Task 3

Additional imports

Selection according to correlation matrix

sklearn.feature_selection.SelectKBest using sklearn.feature_selection.chi2 metric

sklearn.feature_selection.SequentialFeatureSelector

sklearn.feature_selection.RFE

Do analyzed algorithms select the same set of features?

Classification

Conclusions